Guided section

library(tidyverse)
library(plotly)

1. Read data

data <- read_csv('./gapminder_clean.csv') %>%
    select(-1) %>%
    rename(
       co2em = `CO2 emissions (metric tons per capita)`,
       popden = `Population density (people per sq. km of land area)`,
       lifeExp = `Life expectancy at birth, total (years)`,
       energy = `Energy use (kg of oil equivalent per capita)`,
    )

2. Scatter plot of CO2 emissions and GDP in 1962

data1962 <- data %>%
    filter(Year == 1962) %>%
    select(gdpPercap, co2em) %>%
    drop_na()
ggplot(data = data1962) +
    geom_point(mapping = aes(
        x = gdpPercap,
        y = co2em)) +
    labs(x = "GDP per capita", y = "CO2 emissions per capita (metric tons)")

3. Pearson correlation of CO2 emissions and GDP

cor.test(data1962 %>% pull(gdpPercap), data1962 %>% pull(co2em))
## 
##  Pearson's product-moment correlation
## 
## data:  data1962 %>% pull(gdpPercap) and data1962 %>% pull(co2em)
## t = 25.269, df = 106, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8934697 0.9489792
## sample estimates:
##       cor 
## 0.9260817

4. Year of strongest correlation between CO2 emissions and GDP

corrs <- data %>%
    group_by(Year) %>%
    select(Year, gdpPercap, co2em) %>%
    drop_na() %>%
    summarise(correlation = cor(gdpPercap, co2em))
maxi <- lapply(corrs, max)

The strongest correlation is 0.9387918 in the year 2007.

5. Interactive scatter plot of CO2 emissions and GDP

max_em_year_data <- data %>%
    filter(Year == maxi$Year) %>%
    select(gdpPercap, co2em, pop, continent, `Country Name`) %>%
    drop_na()
fig <- ggplot(data = max_em_year_data) +
    geom_point(aes(
        x = gdpPercap,
        y = co2em,
        size = pop,
        color = continent,
        text = paste("Country: ", `Country Name`,
             "\nGDP: ", gdpPercap,
             "\nCO2 emissions: ", co2em))) +
    xlab("GDP per capita") +
    ylab("CO2 emissions per capita (metric tons)") +
    ggtitle(str_glue("GDP vs CO2 emissions per capita in ", maxi$Year))
ggplotly(fig, tooltip = "text")

Open section

1. Relationship between continent and energy use

General visual

Let’s get a general idea of what we’re looking at.

data_energy <- data %>%
    select(continent, energy) %>%
    drop_na()
ggplot(data_energy, mapping = aes(energy, color = continent)) +
    geom_freqpoly(binwidth = 500)

It looks like we’ve got peaks at the same location for Africa, the Americas and Asia. Europe and Oceania are doing their own thing.

The next question is can we perform ANOVA and statistically determine relationships in the data. First, we’ll check the assumptions.

Normality

Let’s observe how normal the data is.

ggplot(data_energy, mapping = aes(sample = energy)) +
    facet_grid(continent ~ .) +
    stat_qq() +
    stat_qq_line()

Africa, Oceania and Europe look approximately normal, but the others do not. Let’s apply a data transformation. We get an empty tibble when filtering for energy values less than or equal to zero. So, it is safe to apply the logarithm to the energy usage data.

data_energy %>%
    filter(energy <= 0)
## # A tibble: 0 x 2
## # … with 2 variables: continent <chr>, energy <dbl>

Let’s try the data transformation.

data_energy_t <- data_energy %>%
    mutate(energy_transformed = log(energy))
ggplot(data_energy_t, mapping = aes(sample = energy_transformed)) +
    facet_grid(continent ~ .) +
    stat_qq() +
    stat_qq_line()

The data look more normal now. Oceania remains normal. The other groups’ data are still not perfect, but it’s okay since we have a relatively large sample size for them.

data_energy_t %>%
    group_by(continent) %>%
    summarise(count = n())
## # A tibble: 5 x 2
##   continent count
##   <chr>     <int>
## 1 Africa      199
## 2 Americas    188
## 3 Asia        185
## 4 Europe      256
## 5 Oceania      20

Independence

The data come from different continents and are at least 5 years apart. Assume the data are independent due to this space and time separation.

Equality of variances

How do the variances look?

ggplot(data_energy_t) +
    geom_boxplot(mapping = aes(energy_transformed, continent)) +
    xlab("Log of energy usage (kg of oil equivalent per capita)") +
    ylab("Continent")

The variances of the transformed data look roughly the same. Oceania stands out as having a smaller variance than the others. Asia looks like it has a larger variance. Despite this, we will continue.

ANOVA

test_results <- aov(energy_transformed ~ continent, data = data_energy_t)
summary.aov(test_results)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## continent     4  342.8   85.70   117.7 <2e-16 ***
## Residuals   843  613.9    0.73                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than 0.001. In our assumptions, we saw that the transformed data were not exactly normal, and their variances were not exactly equal. Taking that into account, I still think we can reject the null hypothesis. That is, we can reject that the log of the group means are equal. Since the logarithm is a monotone increasing function, we can reject that the group means are equal.

Post hoc testing via Tukey’s

TukeyHSD(test_results)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = energy_transformed ~ continent, data = data_energy_t)
## 
## $continent
##                         diff        lwr       upr     p adj
## Americas-Africa   0.62609440  0.3888318 0.8633570 0.0000000
## Asia-Africa       0.53390365  0.2956539 0.7721534 0.0000000
## Europe-Africa     1.61347063  1.3930062 1.8339350 0.0000000
## Oceania-Africa    1.96990907  1.4226918 2.5171263 0.0000000
## Asia-Americas    -0.09219075 -0.3337751 0.1493937 0.8351479
## Europe-Americas   0.98737623  0.7633124 1.2114401 0.0000000
## Oceania-Americas  1.34381467  0.7951374 1.8924920 0.0000000
## Europe-Asia       1.07956698  0.8544581 1.3046759 0.0000000
## Oceania-Asia      1.43600542  0.8869005 1.9851103 0.0000000
## Oceania-Europe    0.35643844 -0.1851867 0.8980636 0.3747658

For every pair except for (Asia, Americas) and (Oceania, Europe), we reject the null hypothesis that the logarithm of the energy usage means are the same.

2. Difference of imports between Europe and Asia

3. Country with the highest average population density in 1962-2007

data_popden <- data %>%
    group_by(`Country Name`) %>%
    select(`Country Name`, popden) %>%
    summarise(avg_popden = mean(popden, na.rm = TRUE)) %>%
    arrange(desc(avg_popden))
num_countries_shown <- 20
ggplot(data = head(data_popden, n = num_countries_shown)) +
    geom_bar(
        mapping = aes(x = avg_popden, y = reorder(`Country Name`, avg_popden)),
        stat = "identity") +
    xlab("Average population density (people per sq. km of land)") +
    ylab("") +
    ggtitle(str_glue(num_countries_shown, " most population dense countries 1962-2007"))

The country with the highest average population density between 1962 and 2007 is Macao SAR, China.

4. Country with the greatest increase in life expectancy at birth since 1962

data_lifeExp <- data %>%
    select(`Country Name`, Year, lifeExp) %>%
    drop_na() %>%
    group_by(`Country Name`) %>%
    filter(any(Year == 1962)) %>%
    mutate(lifeExpSince1962 = lifeExp - lifeExp[Year == 1962]) %>%
    ungroup()

countries_highest <- data_lifeExp %>%
    group_by(`Country Name`) %>%
    mutate(lifeExpTotalChange = lifeExp[Year == 2007] - lifeExp[Year == 1962]) %>%
    summarise(lifeExpChange = max(lifeExpTotalChange)) %>%
    arrange(desc(lifeExpChange))

cutoff_lifeExpChange <- min(head(countries_highest, n = 10)$lifeExpChange)

data_lifeExpTop <- data_lifeExp %>%
    group_by(`Country Name`) %>%
    filter(lifeExp[Year == 2007] - lifeExp[Year == 1962] >= cutoff_lifeExpChange)

fig <- ggplot(data = data_lifeExpTop) + 
    geom_line(mapping = aes(
        x = Year,
        y = lifeExpSince1962,
        color = `Country Name`)
    ) +
    ylab("Life expectancy at birth since 1962 (years)")
ggplotly(fig)

The Maldives had the highest increase in life expectancy at birth from 1962 to 2007.